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We propose an assembly algorithm Barnacle for sequences generated by the clone-based 
approach. We illustrate our approach by assembling the human genome. Our novel method 
abandons the original physical-mapping-first framework. As we show, Barnacle more effec- 
tively resolves conflicts due to repeated sequences. The latter is the main difficulty of the 
sequence assembly problem. Inaddition, we are able to detect inconsistencies in the underlying 
data. We present and compare our results on the December 2001 freeze of the public working 
draft of the human genome with NCBFs assembly (Build 28). 

The assembly of December 2001 freeze of the public working draft generated by Barnacle 
and the source code of Barnacle are available at I http: / / www.cs.rutgers.edu/ ~vchoi I . 



Much of the attention and debate within the Human Genome Project has focussed 
on two sequencing strategies: whole-genome shotgun sequencing, adopted by Celera Ge- 
nomics (jVenter et al. 2001|) : vs. hierarchical shotgun sequencing, also referred to as clone- 
based or BAC-hy-BAC, used by the International Human Genome Sequencing Consortium 
(IHGSC) ( [International Human Genome Sequencing Consortium 2001 ). The data set generated 



by the whole-genome shotgun approach consists of a set of pairs of sequence reads (mates). Each 

read has average length of around 500bp governed by current sequencing technology. The data set 

generated by the clone-based approach consists of a set of Bacterial Artificial Chromosome (BAG) 

clones, typically 100-200Kb. We will use 'BAG', 'clone' and 'BAG clone' interchangeably. Each 
^Corresponding author. Email: vchoi@cs.duke.edu, Fax: (919)660-6519. 
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BAC clone is individually sequenced and assembled. In general, a clone consists of a set of more or 
less non-overlapping fragments. A clone is called a finished clone if it consists of only one fragment 
which has accuracy at least 99.99%, otherwise it is a draft clone. 

The computational problem of reconstructing the genome sequence by assembling the sequences 
of the data set is called the sequence assembly problem. The challenge of the problem lies in 
the repeat-rich nature of the genome. For example, the human genome is more than half re- 
peated sequences, which include large 50-500Kb duplicated segments with high sequence identity 
(> 98%) Pnternational Human Genome Sequencing Consortium 2001 ). Computationally, the se- 
quence assembly problems for the two sequencing strategies are completely different because the 
formats of the data sets are different. 

In this paper, we focus on the assembly problem of sequences generated by the clone-based 
approach, with application to the human genome. In addition to the repeat problem mentioned 
above, the assembly problem of the clone-based approach is further complicated by laboratory 
errors, such as chimeras and contaminations; by the draft quality of the sequences, which are 
sometimes inaccurate; and by polymorphism. 

The set of BAC clones, finished and draft, produced by IHGSC is called the public work- 
ing draft of the human genome. It is worthwhile to point out that the clone-based strat- 
egy used is not the originally proposed "clone by clone" strategy, in which a physical map 
giving the clone ordering is first produced, and then a minimum tiling set of clones in the 
map is selected and individually subjected to shotgun sequencing. In practice, the physical 
map of the clones based on fingerprint overlaps is constructed concurrently with the sequenc- 



ing ( International Human Genome Sequencing Consortium 2001 ). The ordering of the sequenced 
clones is either unknown or not accurately known. 

There are two other algorithms for assembling the public working draft of the human genome. 



The other assemblers, such as ARACHNE ( Batzoglou et al 2002 ) , are incomparable because the 
formats of the input data are different. The assembly problem of the clone-based approach can be 
viewed as a second-stage assembler, in which the input fragments are the preassemblies of the shot- 
gun reads of each BAC clone produced by some assemblers, such as PHRAP (Green, unpublished). 
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One is GigAssembler ()Kent and Haussler 2001|) from UC Santa Cruz and the other is an unpub- 
hshed NCBI's algorithm (http://www.ncbi.nlm.nih.gov/genome/guide/human/). The fingerprint- 
based physical map, which was constructed by manually editting the initial automated finger- 
print assembly (The International Human Genome Mapping Consortium 2001 1, was employed by 
GigAssembler to assist in the sequence assembly, although the clone ordering was only roughly 
determined in the map ({Kent and Haussler 200l1) . NCBI does not make use of the fingerprint map 
(http:/ /www. ncbi.nlm.nih.gov/genome/guide/HsFAQ.htmlT^diffassemblies). (Remark: NCBI's al- 
gorithm was said to be modified to use the hand-curated Tiling Path Files since Build 29.) Instead, 
several STS marker maps were used for chromosome assignments. Both algorithms are based on 
greedy approaches to assemble sequences in which the best overlap is assembled first, where best 
is defined by some score functions used to assign weights for overlaps. Since the genome is highly 
repetitive, these score functions are unlikely to yield the true assembly (Bailey et al. 2001). With- 
out the fingerprint clone contig to have BACs restricted to, the NCBI algorithm first formulates 
a Maximal Interval Subgraph (MIS) problem to obtain a BAG ordering and then assembles the 
sequences of overlapping BACs. This approach is natural given the history of the Genome Project, 
with its original physical-mapping-first approach. However, this top-down approach suffers from 
the fact that if a BAG is misplaced, the mistake in the assembly is unrecoverable. 

In this paper, we propose an assembler Barnacle for the clone-based sequences augmented 
with chromosome- assignment, thus we use the same input data as NCBI. Part of the novelty of our 
approach is to assemble the genome bottom-up, without making use of fingerprint-based physical 
map of the clones. That is, we use extensive sequence-level overlaps to suggest BAG overlaps. 
As we will explain in the discussion section, the clone-overlaps inferred from the sequences are 
more informative than from the fingerprint data. In other words, this approach allows us to 
order BACs with higher confidence. Also, unlike the use of score functions to attempt to resolve 
the repeat problem, our way of resolving the repeat problem by considering the consistency of 
overlaps and the interval graph formalism is mathematically justifiable. In addition, this enables 
us to detect inconsistencies in the underlying data. We remark that this error detection feature 
does not exist in the other algorithms. Finally, we remark that the idea of our algorithm can be 
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naturally extended when additional information, such as low-copy repeated sequences are available. 
A separate work ()Choi et al. 2fln2a|l of modifying Barnacle to incorporate a segmental duplication 
database (Bailey et al. 20021 is in preparation ()Choi et al. 2002E]) . 



Details of Input 

In this section, we describe the input for Barnacle, which includes the sequence information of 
BACs and fragments, overlap information and orientation information. 

BACs and fragments 

A i?^C consists of a contiguous stretch of DNA from a chromosome. The associated data with each 
BAG consists of the estimated length of the BAG, the phase of the BAG, the number of fragments, 
each fragment's sequence and the chromosome of the BAG, if assigned. 

Table H shows an example of BAG AC002092.1 with estimated length 95456bp has four 
fragments. Fragment AC002092.1~1 is the sequence from 1 to 888 in the GenBank record of 
^C002092.1. Notice that we do not know the actual sequence of each BAG but we know the 
sequence of each of its fragments, which are the preassemblies of the shotgun reads of the BAG 
produced by some assemblers, such as PHRAP (Green, unpublished). 

Depending on the raw data coverage, BAGs are categorized into three phases. Phases 1 and 2 
are the draft sequences; phase 3 comprises the finished sequences. For a phase 1 BAG, the fragments 
of the BAG are not necessarily disjoint, and the order and orientation of fragments are in general 
unknown. For a phase 2 BAG, the fragments are disjoint and the order of fragments is known. A 
phase 3 BAG is a finished sequence, i.e. only one fragment. In addition, for some phase 1 BAGs, 
some partial fragment order information and end fragment information are also available. 

Overlap Information 

Based on local alignments of all fragment sequences against all fragment sequences, we further pro- 
cess two types of valid overlaps between fragments: dovetail overlap and complete containment (see 
Fig.^. The sequence identity of the overlaps has to be at least 97% and end-allowed-error (as shown 
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in Fig. ^ is taken to be 350bp for finished sequences and min{10% of the fragment length, 1000}bp 
otherwise. 

Also, according to the annotation of some finished sequences, some overlaps are generated. 
These are called nt-pairs. These overlaps include Obp overlaps, i.e., the fragments do not overlap 
but are consecutive to each other. The nt-pairs are supposed to be true overlaps if the annotations 
are correct and they are generated accordingly. 

Orientation Information 

There are additional sequences from paired-end plasmid reads. These sequences are aligned against 
all fragment sequences. Based on these alignments, the relative orientation of fragment pairs is 
obtained as input. 

Results and the comparison with the pubUc assembly 

In this section, we present the result of our algorithm applied to the input of the December 2001 
freeze of the public working draft of the human genome. First, we describe the details of the input 
of the December 2001 freeze. Then we present and compare the results with the NCBI's assembly 
(Build 28). 

Input of the December 2001 freeze 

The sequence information with chromosome assignments, local alignments of all-against-all frag- 
ment sequences, local alignments of all fragment sequences against plasmid reads sequences were 
provided by NCBI. We further processed these local alignments to generate valid fragment pair 
overlaps and relative orientation of fragment pairs. The details of the December 2001 freeze are 
shown in Table |2I 

Results on the December 2001 freeze 

We assemble 219, 031 non-singleton fragments into 12726 subcontigs, with length 2.51 Gbp. From 
these subcontigs, we obtain a clone graph of 33929 BACs in 2195 connected components, which 
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consists of 23145 BACs in 2052 interval connected components and 1078 BACs in 143 non-interval 
connected components. Upon resolving non-interval components, 139 problematic BACs, which are 
either suspicious chimeras or contains unidentified repeats, are removed. The result is 33722 non- 
singleton BACs in 2443 interval components, as shown in Table |31 We remove 8189 (= 259230 — 
251041) fragments based on our FN detection which indicates the fragments should be contained 
in some subcontigs. 

A graphical display of the assembly of December 2001 freeze of the working draft of the human 
genome is available at (http://www.cs.rutgers.edu/~vchoi). 

Comparison with NCBI's assembly 

We compare our results with NCBI's assembly on December 2001 freeze (see Table^. The 29537 (= 

251928 — 222391) fragments were absent from NCBI's build for reasons we do not know. To further 

i-j- Pii, 1,1 • i J j-i, J c -i.- t assembled BAC length 

measure the quahty of the assembly, we mtroduce the dehnition of warp = gg^^^^^g^^^gj BAC length ' 

which is the ratio of the assembled BAC length over the estimated BAC length. See Table [3 for 

the comparison. There are 1016 BACs in our assembly with assembled BAC length more than 

250 Kbp (the longest one is 732,527 bp), while there are 3786 such BACs (with the longest one 

19, 436, 525 bp) in NCBI's assembly. We conjecture that the number would be even worse if they 

had not thrown away the 29K fragments. Note that while it is true that the warped BACs are 

misassembled, it is not necessary that the non-warped BACs are correctly assembled. Indeed, the 

suspicious chimeric BACs we detected are usually non-warped in NCBI's assembly because of the 

way their assembly is done. To measure the accuracy of the assembly, it is important to understand 

the method or the algorithm of the assembly. 



Discussion 

Our algorithm is very efficient. It takes 3 minutes on a Pentium III (933 MHz) computer to assemble 
the public working draft of the human genome after preprocessing to the input format as described. 
For the same step, GigAssembler takes about two hours on a cluster of 100 Pentium III CPUs. We 
only know it takes less than one day by NCBI's assembler but do not know the actual time. Most 
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importantly, unlike the other two algorithms, BARNACLE is based on a mathematically justifiable 
(see section Methods for details) approach for assembling the sequences. The algorithm not only 
can detect the inconsistency of underlying data but also can be well adapted to many situations 
which may arise as the technology changes and more data becomes available (for example, see 
(IChoi et al. 2nn2bl) 'l. 

Also, we would like to note that another advantage of our algorithmic approach is that we make 



better use of the available data. As it was also pointed out in (Myers 19991, the clone-overlaps 
inferred from the sequences are more informative than from the fingerprint data. While the NCBI 
algorithm does not make use of the fingerprint-map as GiG Assembler does, the way NCBI makes 
use of the fragment overlaps to infer the clone overlaps is in the traditional, less informative physical- 
mapping clone overlap fashion. More precisely, in physical-mapping, one only knows whether two 
clones overlap by some common fingerprints, rather than at which ends they overlap, while such 
information is explicit in the fragment overlaps. See Figure |2l for this important distinction. In 
particular, NCBI formulates a Maximal Interval Subgraph (MIS) problem (which is NP-hard) to 
obtain a clone ordering, where the weight of a clone-overlap is the summation of weights of all the 
corresponding fragment pairs, which are assigned by some score functions. In contrast, we first 
"conservatively" assemble fragments, which uses the full information of fragment overlaps (and not 
just whether two fragments overlap), and then infer the clone overlaps from these subassemblies. 

It is worthwhile to point out that it is theoretically impossible to assemble the true assembly 
based only on the sequence overlap information, although the way we resolve the repeat problem 
is justified by considering the consistency of overlaps and the interval graph formalism. Indeed 
the assemblies are far from perfect. For example, the assemblies of those "warped" BACs whose 
assembled length is greater than 250Kb are obviously incorrect. These misassemblies might be 
largely due to undetected FNs, or they might be due to over-collapse of repetitive regions, which do 
not destroy the interval graph property; for example, repeats which occur at one end of contigs due 
to the lack of coverage or repeats which are longer than an entire BAC. More biological information 
is needed to resolve these cases. For instance, the 500Kb inverted repeat on chromosome Y is 
resolved by using the annotation of sequences in the GenBank. Also, the detected suspicious 
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errors, which include chimeras, chromosome misassignments, wrong annotation of some finished 
sequences and fragment misassembhes need further foUow-up and verification by the sequencing 
centers. We beheve that in order to achieve a high-quahty assembly, an iterative process involving 
collaboration with the sequencing centers is necessary. 

An original goal of the Human Genome Project is to provide an accurate reference sequence of 
the euchromatic portions of all human chromosomes (jCollins et at. 1998|l and it is essential for an 
assembler to resolve repeats correctly in order to achieve accurate results ()Eichler 2001|) . It is our 
hope that future (automated) assemblers for clone-based sequences of whole genomes (including 
other higher organisms, such as mouse, maize and rice etc.) will be developed and improve on 
this well justified framework. Since the human genome project, new sequencing strategies have 
been developed, e.g., hybrid of clone-based and whole-genome shotgun. Nevertheless, we believe 
researchers can benefit from the algorithmic skills to develop the new assemblers. The source code 
of Barnacle is freely availabe at (http://www.cs.rutgers.edu/~vchoi). 



Methods 

Before we describe the idea behind our algorithm, we introduce some terminology. 

We say an overlap is true if the fragments are from the overlapping segments of the genome, 
otherwise the overlap is repeat-induced (Fig. ISJ ( |Myers et al. 2000 ). The overlaps which are not 



true are called False Positives (FPs). In other words, repeat-induced overlaps are FPs. On the 
other hand, the overlaps which are true but not detected are called False Negatives (FNs). FPs 
and FNs are called noise. 

The assembly problem would be straightforward if we could divine all true overlaps, i.e., if the 
data of overlaps were noise-free. The key objective is thus to clean up the noise as much as possible 
and assemble the fragments according to the true overlaps. 

Observe that for an overlap to be true, it is necessary for the overlap to be "consistent" (Fig. |1J). 
However, "consistency" is not sufficient for a true overlap. There might be some consistent but 
repeat-induced overlaps, which are due to either the lack of coverage or long repeats (Fig. ISJ. We 
call a consistent but repeat-induced overlap a consistent repeat-induced overlap, otherwise we call 
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it an inconsistent repeat-induced overlap. 

A contig (subcontig resp.) is a contiguous region that is covered by a set of overlapping BACs 
(fragments resp.). 

The basic idea behind our algorithm is illustrated in Figure El and Table IHl shows the high level 
description of the algorithm. In following, we outline the idea of each step. The details of the 
implementation of the algorithm are described in (IChoi 20011) fhttp: // www . cs . r utger s . edu/ ~ vchoi) . 

"Conservatively" assemble fragments into subcontigs. First, we assemble consistent over- 
lapping fragments into subcontigs. Before we describe the algorithmic implementation, we 
introduce some terminology. A fragment is called a subfragment if the fragment is com- 
pletely contained in another fragment, otherwise the fragment is maximal. Making use of 
the maximality property of the latter, we efficiently identify and assemble consistent overlap- 
ping maximal fragments (once again readers are invited to read (|Choi 2001|1 for the simple 
algorithmic trick). Then consistent overlapping subfragments are assembled into the contig 
of their corresponding maximal fragments. With this implementation, 219, 031 non-singleton 
fragments of the December 2001 freeze are assembled into 12,726 subcontigs in about 25 
seconds on a Pentium III computer. 

We then deduce clone-overlaps from these subcontigs: two clones overlap if and only if there 
is at least one fragment pair of the corresponding clones overlapping in a subcontig. Then the 
conflicting overlaps are resolved according to the clone-overlaps (Fig. [TJ. Unlike the use of 
score functions to resolve conflicts, we are making use of the BAG information of fragments 
and use the assembly obtained by consistent overlaps to resolve inconsistent overlaps. This 
approach is well justified because the consistent overlaps, which are necessary condition for 
true overlaps, give a good indication whether two BACs overlaps or not. Note that it is this 
approach that allow us to naturally make use of the segmental duplication database without 
substantial changing the algorithm. 

As mentioned before, these subcontigs might still contain some consistent repeat-induced 
overlaps. 
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Detect and remove consistent repeat-induced overlaps and chimeric clones. We use 

the hnear structure of the chromosome to detect the consistent repeat-induced overlaps. The 
linear structure of the sequence would be destroyed if the repeat-induced overlaps were used 
(Fig. |H^). Mathematically speaking, the corresponding clone graph, whose vertices are clones 
and there is an edge between two vertices if and only if the two corresponding clones overlap, 
must be an interval graph. (Remark: This is true with the assumption that the length of gaps 
within a BAG is short enough such that there is at least one fragment pair overlapping if two 
BAG overlaps. However, this assumption is no longer true in the current working draft of the 
human genome because some BAGs were heavily trimmed resulting some of the BAGs only 
a few hundreds bp long (much shorter than a possible gap). Gomparing April 2001 freeze 
with December 2001 freeze, we found that 1200 BAGs whose length have been reduced more 
than half of its original length. Note that our algorithm is robust enough that with a slight 
modification of resolving non-interval graph procedure, one can heuristically take care of this 
possibility. ) Intuitively, imagine the genome sequence as a line, then each BAG is an interval 
of the line such that two BAG sequences overlap if and only if the corresponding intervals 
overlap (Fig.lHh)- In other words, if the corresponding clone graph is not interval, the assembly 
is incorrect. For example, BAG AG019248 in NGBI's Build 28 is misassembled (Fig.lHb)- By 
resolving the non- interval connected components of the clone graph, we detect and remove 
suspicious repeat-induced overlaps and chimeric clones (Fig. IHJa). Instead of resorting to the 
NP-hard Maximal Interval Subgraph problem, which does not characterize the real biology, 
to get an interval graph, we design an efficient algorithm for resolving the non-interval graph 
(see Table 13 for the idea). The interval representation (and hence the ordering) of clones is 
obtained from the resulting interval clone graph by a linear-time interval graph recognition 
algorithm ()Gorneil et al. 200l]l . 

Orient and order subcontigs. According to the interval representation of clones, subcontigs 
are ordered and the orientation of "long" subcontigs is determined. According to the ordering 
of clones, we orient subcontigs by hipping such that the ranks of BAGs in each subcontig are 
in non-decreasing order, and assign coordinates to subcontigs so that they can be ordered by 
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sorting according to these coordinates lexicographically. Note that our interval representation 
of clones is quite informative in that most of the subcontigs can be ordered unambiguously 
according to this reliable information. Thus we do not need to employ the quite noisy and 
uncertain information from plasmid reads, ESTs and mRNAs to order the subcontigs, as 
does GigAssembler, which resorted to the Bellman-Ford algorithm for testing the feasi- 
bilities of the information. Also, the interval representation allows us to detect FNs while 
ordering subcontigs. First observe that the corresponding end BACs of the adjacent subcon- 
tigs must be either the same or overlapping (Fig. llOb ). We call this necessary condition the 
adjacency condition. Accordingly, when some subcontigs cannot be ordered such that they 
satisfy the adjacency condition, it indicates that there might be FNs (Fig. llOb ). To further 
verify the identification of the FNs, we aligned the involved fragments with their overlapping 
clones. Examining these alignments reveals the several possible causes, which include the 
consequences of repeat-masking, low accuracy of some draft sequences, chimeric fragments 
or fragment misassemblies and polymorphism. On the other hand, some subcontigs might 
be ties in which the adjacency condition is always satisfied for any permutation of them. In 
other words, the information is not sufficient to determine their order. In this case, we employ 
some additional information to break these ties. 

Adjust the ordering and correct the orientation of the subcontigs using additional 
information. The additional information includes the identification of the end fragments of 
BACs, as well as, the partial order of some fragments. For each BAG which has end fragment 
information, i.e., one or two fragments, first according to their current position in the contig, 
we determine which one is the left end fragment and which one is the right end fragment. 
Then we orient the fragments so that they are the extreme fragments of the BAG. To ensure 
the reliability of the information, these adjustments (order and orientation) are always subject 
to the adjacency condition, i.e., whether we can adjust the order and orientation according to 
the information such that the adjacency condition is still satisfied. Similarly, the adjustments 
are being done for each group of the ordered fragments. 

Finally, the relative orientation of fragment pairs generated from plasmid reads are used to 
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orient the subcontigs which are not "long" enough to have been determined. According to 
the interval representation of clones, one can determine how confident the orientation of sub- 
contigs is. For example, for the subcontigs which are long enough so that the smallest BAG 
and the largest BAG are not overlapping in the subcontig, the correctness of the orientation is 
sure. The relative orientation of fragment pairs generated from plasmid reads are used to ori- 
ent the unsure subcontigs. For each unsure subcontig, we organize all its relative orientations 
into a list of agreeing subcontigs and a list of disagreeing subcontigs. We then progressively 
change the orientation of unsure subcontigs such that the number of disagreeing subcontigs 
is minimized. 
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Figure Legends 

Figure^ Two types of valid overlaps between a fragment pair. (1) dovetail overlap vs. (2) complete 
containment. Because of the draft quality of sequences, end errors are allowed (shown as dashed 
lines). 

Figure 121 Sequence overlap vs. fingerprint overlap. (1) There are 3 sequence overlaps, but they 
are incompatible. At least one of them is FP. (2) Treated as fingerprint overlaps (whether two 
fragments overlap) would result in an incorrect interval representation of the 3 BACs. 
Figure|31 True vs. repeat-induced overlaps. Consider the overlap of two fragments A and B. There 
are two possible inferences: (1) The overlap is true, where the fragments are from the overlapping 
segments of the genome. (2) The overlap is induced by repeated sequence R. 

Figure m Consistent vs. inconsistent overlaps. Left, fragment b overlaps with fragments a and c as 
shown, and a and c also overlap accordingly. The overlap of b and a is consistent. Right, fragment 
g overlaps with fragments / and h as shown, but / and h do not overlap accordingly. The overlaps 
of g and /, g and h, are inconsistent. At least one of them is repeat-induced. 

Figure |SJ Overlaps that are consistent but not true. There are two possibilities: (1) lack of 
coverage: absence of both fragments L and R; (2) long repeats: the repeat is long enough such that 
the overlaps are consistent. 

Figure ini The idea behind our algorithm. Conceptually, the algorithm consists of three steps. (A) 
Fragments are "conservatively" assembled into subcontigs. The details of this step constitute steps 
1-6 of the algorithm. (B) Consistent repeat-induced overlaps and chimeric clones will destroy the 
interval property of the clone graph. Resolving this step corresponds to step 7 of the algorithm. (C) 
According to the interval realization of clones obtained from the interval clone graph, subcontigs 
are oriented and ordered in steps 8-10 of the algorithm. Remark: Steps 11-15 of the algorithm, 
which use additional information to adjust the ordering and correct the orientation, are not shown 
in this figure. 

Figure[7| Resolving inconsistent overlaps. Since AC020954.6~1 has several consistent overlaps with 
fragments of BAC AC027726.2, BAC ^C020954.6 overlaps with BAC ^C027726.2. The inconsis- 
tent overlaps of AC020954.6"'l and AC008760.6~1, y4C020954.6-l and AC027726.2"'33 are resolved 
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according to the clone-overlap. In this case, the overlap of A(7020954.6~l and ^(7027726. 2~33 is 
chosen, i.e. the overlap of ^(7020954.6"" 1 and AC008760.6~1 is repeat-induced. 
Figure |S1 a The result of collapsing a repeat region. The blue and green denote contigs covered 
by a set of overlapping BACs; while the red represents a BAG consisting of 4 fragments. For 
illustrative purposes, suppose the repeat region occurs in the two fragments of the red BAG. The 
linear structure of the sequence would be destroyed if the repeat region were to be collapsed, b 
Suspicious chimeric BAGs. Left, the problematic region occurs in the middle of a pair of contigs, 
the BAG is the most suspicious chimera. Right, when the problematic region occurs at one end of 
one contig, it is difficult to tell that it is due to a repeat or whether the BAG is chimeric. To ensure 
the quality, the BAG is removed. 

Figureini a Geometric view of the sequences. The genome sequence is a line; each BAG corresponds 
to an interval of the line, b Above, the contig of BAG ^40019248 in NGBFs Build 28, which consists 
of 10 fragments (only 9 fragments were used in NGBFs Build 28), is non-interval. Biologically, the 
misassembly of BAG AC019248 is also evident by its assembled length of l,556,292bp. Below, 
Barnacle's interval-graph assembly of the corresponding contig. 

Figure HJH a The condition of adjacent subcontigs. The corresponding end BAGs of the adjacent 
subcontigs must be either the same or overlapping, b FNs detection. No matter how we order the 
subcontigs, the subcontig in the box will violate the adjacency condition. This is due to a FN as 
the arrow shows. 
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(1) dovetail overlap 
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(2) complete containment 
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Figure 1: 
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(1)3 sequence overlaps 
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(2) One interval representation of the 3 BACs if the overlaps were treated as fingerprint overlaps. 

A 

B 



Figure 2: 
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(1) true overlap 




(2) repeat-induced overlap 
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Figure 3: 
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Consistent overlaps 
a=AC020895.5~l 
b=AC010647.4--58 
c=AC027726.2~31 



Inconsistent overlaps 
f=AC008760.6-l 
g=AC020954.6-l 
h=AC027726.2-33 



g 




Figure 4: 
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(1) lack of coverage 
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(2) long repeats 



Figure 5: 
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A. "Conservatively" assemble fragments into subcontigs 



B. Detect and remove consistent repeat-induced overlaps and chimeric clones 




Contig 1 

Interval representation 

of clones 

Contig 2 

Interval representation 

of clones 

C. Orient and order subcontigs 

Contig 

Clones 

Subcontigs 



Figure 6: 
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inconsistent overlaps 

consistentoverlaps / 







AC008760. 1 






AC02095 


1.6-1 

AC027726.2-33 


fragments of AC027726.2 







Assemble consistent overlapping fragments into a subcontig 



AC020y54.6~I 

fragments of AC027726.2 

I Resolve inconsistent overlaps according to clone-overlaps 

AC020954.6~1 

— — AC027726.2~33 



Figure 7: 
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Figure 8: 
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a 

Genome Sequence 
BACs 




1970 2070 2170 2270 2370 (kb) 



AL358033.15 

AL359816.16 

AL358g78.17 

AL355339.7 



AG024339.1' 
AL360230.20 I 



AL353576.18 
AC021041.6 
AC021030.9 
AC021033.5 
AC021031.6 
AC022282.4 I 
AC073367.10 
AC0G8893.4 




AC01 9248.3 II 
AL365215.23 ^^^^^H 
AC053510.3 

AL596445.2 
AC024605.6 

AC067747.5 

AL133415. 




AC01 6235.3 

AL158164.15 I 



AL160289.14 

AL391334.16 I 



Clone Information: 

Accession: AC019248.3 
Chromosome: 10 
Laboratory: WIBR 
Clone Name: RPl 1-1 15G24 
Phase: 1 

Estimated Length: 147259 bp 
Assembled Length: 147725 bp 

Figure 9: 



25 



a 

Genome Sequence 



BACs 



Subcontigs 



The corresponding end BACs of the adjacent subcontigs 
should be either the same or overlapping. 



b 

Genome sequence 



Overlapping BACs 



Subcontigs 



Figure 10: 
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accession # estimated length phase chrm number of fragments 



AC002092.1 95456 




1 


17 


fragment 


acc# 


start 


end 


length 


AC002092 


.1"! 


1 


888 


888 


AC002092 


. 1~2 


889 


46200 


45312 


AC002092 


.1~3 


46201 


84925 


38725 


AC002092 


.1~4.1 


84926 


95170 


10245 



Table 1: An example of a BAG. 
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1. Sequence Information 

phase BACs fragments total length in Gbp average number of fragments 



1 


15298 


246424 


2.55 


16.11 


2 


2154 


8161 


0.33 


3.79 


3 


17021 


17621 


2.04 


1.00 


Total 


35076 


272209 


4.992 


7.76 



2. Chromosome Assignment 

The scheme employed by NCBI to assign a chromosome to a BAG is based on: 

• STS: presence of at least 2 STS markers that have themselves been mapped to the same 
chromosome; 

• GenBank: annotation on the submitted GenBank record; 

Otherwise, the chromosome of the BAG is Unknown. 
The chromosome assignments are summarized as below: 

STS GenBank Unknown 
31543 2450 1083 

3. Overlap Information There are 403, 466 fragment pairs of the same chromosome or at least 
one of them an unknown chromosome. There are 12, 656 nt-pairs. 

4. Orientation Information Relative orientation fragment pairs are generated from paired-end 
plasmid reads. There are 321,751 such fragment pairs. 

Table 2: The details of the input of the December 2001 freeze. 

BACs Fragments Used/Fragments Contigs Length in Gbp 
singletons 1215 9967/9967 1215 0.142 
non-singletons 33722 251041/259230 2443 2.708 

34937 261008/269197 3658 2.850 

Table 3: The results of Barnacle on December 2001 freeze. 
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BACs Fragments Used/Fragments Contigs Length in Gbp 
singletons 836 9074/9074 836 0.112 
non-singletons 32902 222391/251928 2292 2.745 

33738 231465/261002 3128 2.857 

Table 4: Build 28 - NCBI's assembly on December 2001 freeze. 



Warp Our Assembly NCBI's Public Assembly 
<= 1.5 33474 29647 



1.5-1.8 


753 


725 


1.8-2.0 


278 


371 


2.0-5.0 


421 


1813 


5.0 - 10.0 


10 


612 


> 10.0 


1 


570 



Restricting warp > 1.5, 



Assembled BAG Length 


Our Assembly 


NCBI's Pubhc Assembly 


250X - 300^ 


434 


461 


300K - 500K 


549 


1328 


500K - 800K 


33 


798 


800K - IM 





248 


1M-2M 





496 


2M-3M 





129 


3M - lOM 





259 


lOM - 20M 





67 


Total 


1016 


3786 



Table 5: Comparison of Barnacle's assembly with NCBI's assembly on December 2001 freeze. 
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A. "Conservatively" assemble fragments into subcontigs. 

1. Classify fragments: singleton, subfragment and maximal fragment. 

2. Assemble consistent overlapping maximal fragments into subcontigs. 

3. Put back good subfragments to subcontigs. 

4. Detect and resolve conflicting chromosome assignments. 

5. Construct a BAG graph from subcontigs. 

6. Resolve inconsistent overlaps according to the connectivity of the BAG graph. 

B. Detect and remove consistent repeat-induced overlaps and chimeric clones. 

7. For each component Gi of the BAG graph, if Gj is not interval, resolve the 
component by removing repeat-induced overlaps or suspicious chimeric BACs. 

C. Orient and order subcontigs. 

8. Obtain the interval representation of BACs. 

9. Orient subcontigs. 

10. Assign coordinates to subcontigs and order subcontigs by sorting lexicographically. 

11. Detect the potential false negatives and remove the involved fragments. 

12. Detect false positives (consistent repeat-induced overlaps that do not destroy 
the interval property). 

D. Adjust the ordering and correct the orientation of the subcontigs using 
additional information. 

13. Adjust the order of the subcontigs according to the extra fragment information. 

14. Orient subcontigs according to the relative orientation of fragment pairs generated 
from plasmid reads, ESTs and mRNAs data. 

15. Derive a consensus sequence for each contig from the assembly of maximal 
fragments of the contig. 

Table 6: The high level description of the algorithm. 
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In the following, we sketch the idea of how the non-interval graphs are resolved. The interested 
reader is invited to read (Choi 2001) for the details. 

Let G = (y, E) be a non-interval graph. Without lose of generality, assume that G is connected. 
Define a vertex f € y to be I- critical if G|y\|j,} is interval. 

Given a non-interval graph G, we first identify a forbidden subgraph of G, then check whether at 
least one of the vertices of the forbidden subgraph is I-critical, if so, we say G is fixable. 
Based on the structure of the forbidden subgraph, a fixable graph G is resolved by 

1. adding an FN edge; or 

2. removing FP edges due to an identified repeat; or 

3. removing a vertex which is either a suspicious chimera or contains an unidentified repeat. 

For the non- fixable graphs, we employ a divide- and-conquer method by dividing the graph according 
to some articulation points such that each subcomponent is fixable. After fixing the subcomponents, 
the non-fixable graph would become fixable because the articulation point would become I-critical. 

Table 7: The idea of resolving non-interval graph. 
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